22 April 2020

Availability

Motivation: strain identification

Motivation: strain identification

  • Bacterial infections remain a major killer around the world.
  • Resistance emerging in many common pathogens
  • Whole-genome sequencing
    • Precision diagnostics
    • Epidemiological studies
    • \(< 20\) € per genome
  • … but isolating strains for sequencing is expensive.

One solution: plate sweep sequencing

  • Sequence all strains on a culture plate at once.
    • Less laboratory work: skip the isolation step.
    • Sequence everything at once.
    • More representative of the variation in the original sample.

Traditional data collection

Culture -> Isolate -> Culture -> Sequence -> Analyse

Plate sweep sequencing

Culture -> Sequence -> Analyse

One solution: plate sweep sequencing

  • Sequence all strains on a culture plate at once.
    • Less laboratory work: skip the isolation step.
    • Sequence everything at once.
    • More representative of the variation in the original sample.
  • Challenges
    • Strains are difficult to distinguish in mixed data.
    • Pipelines are designed for isolate sequencing data.
    • Unconventional: not (yet?) widely done.

Example: Staphylococcus aureus

The strain identification problem

Assume: known clusters \(c\) of bacterial strains and their reference sequences

Task: given sequencing reads \(r_n\), identify the underlying mixture of clusters and their proportions \(\theta\)

Single reads may be ambiguous and belong to multiple strains.

The strain identification problem

Assume: known clusters \(c\) of bacterial strains and their reference sequences

Task: given sequencing reads \(r_n\), identify the underlying mixture of clusters and their proportions \(\theta\)

Single reads may be ambiguous and belong to multiple strains.

The strain identification problem

Assume: known clusters \(c\) of bacterial strains and their reference sequences

Task: given sequencing reads \(r_n\), identify the underlying mixture of clusters and their proportions \(\theta\)

Single reads may be ambiguous and belong to multiple strains.

mSWEEP pipeline (Mäklin et al. 2020)

Pre-processing (one time operation):

  1. Obtain reference genomes (e.g. from ENA or NCBI)
  2. Cluster the reference genomes
  3. Prepare an index of the reference genomes.

Analysis (once for each sample):

  1. Pseudoalign the reads (Themisto)
  2. Estimate the mixing proportions (mSWEEP)

Example: Staphylococcus aureus

Experiment: true origin of isolate sequencing data

Setup: isolate sequencing data from multiple species:

  • Campylobacter jejuni and Campylobacter coli
  • Escherichia coli
  • Klebsiella pneumoniae
  • Staphylococcus epidermidis

Can we identify the true sequence type with mSWEEP?

  • Let’s see…

Results: true origin of isolate sequencing data

Experiment: synthetic mixtures

Setup: synthetically mix the isolate data from the previous slides together. How well do we identify the true source now?

Results: synthetic mixtures

Analysing clinical sequencing samples

  • As a practical example, let us look at some mixed sequencing data analyzed in Sankar et al. (Microbial genomics 2016)

  • We will analyse a small number of clinical sequencing samples of S. aureus that are suspected of contamination because corresponding assemblies are significantly longer than a typical S. aureus genome

  • To run this yourself, you will first need to install

mSWEEP pipeline (Mäklin et al. 2020)

Pre-processing (one time operation):

  1. Obtain reference genomes (e.g. from ENA or NCBI)
  2. Cluster the reference genomes
  3. Prepare an index of the reference genomes.

Analysis (once for each sample):

  1. Pseudoalign the reads (Themisto)
  2. Estimate the mixing proportions (mSWEEP)

Index preparation

Download and extract the S.aureus and Staphylococcus epidermidis sequence assemblies produced in Méric et al. (Genome Biology and Evolution 2015)

wget https://ndownloader.figshare.com/files/22366911 \
     -O staphylococcus_reference.tgz
tar -zxvf staphylococcus_reference.tgz

Run mlst to identify the clusters (sequence types)

mlst staphylococcus_reference/*.fna > staph_sts.tsv

Filter out sequences that can’t be assigned to a sequence type

grep -v "[[:blank:]]-[[:blank:]]" staph_sts.tsv > staph_sts_fil.tsv

Index preparation (cont.)

Concatenate the sequences by adding a 240 bp gap between contigs, and merge the sequences to a single fasta file

gap=$(printf 'N%.0s' {1..240})
while read file; do
    name=${file%.fna}
    name=${name##*/}
    name=">"$name
    echo $name >> reference.fasta
    sed "s/^>.*/$gap/g" $file | tr -d '\n' | fold >> reference.fasta
    echo "" >> reference.fasta
done < <(cut -f1 staph_sts_fil.tsv)

Extract the cluster indicators

cut -f2,3 staph_sts_fil.tsv | sed 's/[[:blank:]]/-ST/g' > clusts.txt

Exploring the reference

head reference.fasta
## >1076NL
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## CATGGAACTGATGAATATTATTTATATAAAAACTTTAGAAACGTAATAGAAATAGAAAATGAAATGATATATAACAAATA
## TTCACATGGTGAAATCATTGATGATGAATTTTATGGATATCAATTAGATTTAAGTCCTGAAATTTTAGAAGCTTATCAGA
## CATTTTGTAACTTAATTATTGATAAACAAACAGTACCTAGCTTAATTTACTATGATGAAGACGTAAGAGAAGCAGGGTTA
## AGATATTGTGAAGCGGTTGAAAATGAGCTATATAGATTACCTGATAATGAAAAAGTCATTGACAATAGTATATTAAATAT
## CCATAAACTTGGATTAGTGTATCACGGAAAAAATATCTACTTTACGTCGTTACATCCTCTTTTAATAAGATATGAGATAG
## AGAAGACAATGCAACTTAAAGAAAACAAAATTAATGAAAAAGTATTTAGAAAATATAACCCTATTGGTTTATTACCTTAT

Exploring the reference

grep '>' reference.fasta
## >1076NL
## >1347N
## >19N
## >59N
## >612N1
## >619N1
## >619N2
## >747N1
## >790N
## >803NLR2
## >93N
## >94N2B
## >966N
## >AGT24
## >BUG37
## >BUG46
## >COB20
## >CV13
## >CV20
## >CV28
## >CV45
## >DEN107
## >DEN110
## >DEN116
## >DEN161
## >DEN178
## >DEN185
## >DEN189
## >DEN19
## >DEN22
## >DEN61
## >DEN62
## >DEN69
## >DEN73
## >DEN76
## >GCA_000007645.1_ASM764v1_genomic
## >GCA_000009005.1_ASM900v1_genomic
## >GCA_000009585.1_ASM958v1_genomic
## >GCA_000009645.1_ASM964v1_genomic
## >GCA_000009665.1_ASM966v1_genomic
## >GCA_000010445.1_ASM1044v1_genomic
## >GCA_000010465.1_ASM1046v1_genomic
## >GCA_000011265.1_ASM1126v1_genomic
## >GCA_000011505.1_ASM1150v1_genomic
## >GCA_000011525.1_ASM1152v1_genomic
## >GCA_000011925.1_ASM1192v1_genomic
## >GCA_000012045.1_ASM1204v1_genomic
## >GCA_000013425.1_ASM1342v1_genomic
## >GCA_000013465.1_ASM1346v1_genomic
## >GCA_000016805.1_ASM1680v1_genomic
## >GCA_000017085.1_ASM1708v1_genomic
## >GCA_000017125.1_ASM1712v1_genomic
## >GCA_000024585.1_ASM2458v1_genomic
## >GCA_000025145.2_ASM2514v1_genomic
## >GCA_000027045.1_ASM2704v1_genomic
## >GCA_000144955.1_ASM14495v1_genomic
## >GCA_000145595.1_ASM14559v1_genomic
## >GCA_000146385.1_ASM14638v1_genomic
## >GCA_000149015.1_ASM14901v1_genomic
## >GCA_000153665.1_ASM15366v1_genomic
## >GCA_000159535.2_ASM15953v2_genomic
## >GCA_000159555.1_ASM15955v1_genomic
## >GCA_000159575.1_ASM15957v1_genomic
## >GCA_000160235.1_ASM16023v1_genomic
## >GCA_000160355.1_ASM16035v1_genomic
## >GCA_000160375.1_ASM16037v1_genomic
## >GCA_000160395.1_ASM16039v1_genomic
## >GCA_000160415.1_ASM16041v1_genomic
## >GCA_000162595.1_ASM16259v1_genomic
## >GCA_000162615.1_ASM16261v1_genomic
## >GCA_000162635.1_ASM16263v1_genomic
## >GCA_000162655.1_ASM16265v1_genomic
## >GCA_000162675.1_ASM16267v1_genomic
## >GCA_000162695.1_ASM16269v1_genomic
## >GCA_000162715.1_ASM16271v1_genomic
## >GCA_000162735.1_ASM16273v1_genomic
## >GCA_000162795.1_ASM16279v1_genomic
## >GCA_000162815.1_ASM16281v1_genomic
## >GCA_000162835.1_ASM16283v1_genomic
## >GCA_000162855.1_ASM16285v1_genomic
## >GCA_000163335.1_ASM16333v1_genomic
## >GCA_000163575.1_ASM16357v1_genomic
## >GCA_000164075.1_ASM16407v1_genomic
## >GCA_000164715.1_ASM16471v1_genomic
## >GCA_000171435.1_ASM17143v1_genomic
## >GCA_000171455.1_ASM17145v1_genomic
## >GCA_000172855.1_ASM17285v1_genomic
## >GCA_000174455.1_ASM17445v1_genomic
## >GCA_000174475.1_ASM17447v1_genomic
## >GCA_000174495.1_ASM17449v1_genomic
## >GCA_000174515.1_ASM17451v1_genomic
## >GCA_000174535.1_ASM17453v1_genomic
## >GCA_000174555.1_ASM17455v1_genomic
## >GCA_000174575.1_ASM17457v1_genomic
## >GCA_000174595.1_ASM17459v1_genomic
## >GCA_000174615.1_ASM17461v1_genomic
## >GCA_000174635.1_ASM17463v1_genomic
## >GCA_000174955.1_ASM17495v1_genomic
## >GCA_000175475.1_ASM17547v1_genomic
## >GCA_000175495.1_ASM17549v1_genomic
## >GCA_000175955.1_ASM17595v1_genomic
## >GCA_000176195.1_ASM17619v1_genomic
## >GCA_000177115.1_ASM17711v1_genomic
## >GCA_000177995.1_ASM17799v1_genomic
## >GCA_000178015.1_ASM17801v1_genomic
## >GCA_000178035.1_ASM17803v1_genomic
## >GCA_000180095.1_ASM18009v1_genomic
## >GCA_000180395.1_ASM18039v1_genomic
## >GCA_000187145.1_ASM18714v1_genomic
## >GCA_000187165.1_ASM18716v1_genomic
## >GCA_000189435.2_ASM18943v2_genomic
## >GCA_000189455.2_ASM18945v2_genomic
## >GCA_000193835.2_ASM19383v2_genomic
## >GCA_000193855.2_ASM19385v2_genomic
## >GCA_000193875.2_ASM19387v2_genomic
## >GCA_000204665.1_ASM20466v1_genomic
## >GCA_000205345.2_ASM20534v2_genomic
## >GCA_000205365.2_ASM20536v2_genomic
## >GCA_000205385.2_ASM20538v2_genomic
## >GCA_000205405.2_ASM20540v2_genomic
## >GCA_000205425.2_ASM20542v2_genomic
## >GCA_000208695.1_ASM20869v1_genomic
## >GCA_000210315.1_ASM21031v1_genomic
## >GCA_000212435.2_ASM21243v2_genomic
## >GCA_000215405.2_ASM21540v2_genomic
## >GCA_000215425.2_ASM21542v2_genomic
## >GCA_000221685.2_ASM22168v2_genomic
## >GCA_000221725.2_ASM22172v2_genomic
## >GCA_000221785.2_ASM22178v2_genomic
## >GCA_000221805.2_ASM22180v2_genomic
## >GCA_000221845.2_ASM22184v2_genomic
## >GCA_000237125.1_ASM23712v1_genomic
## >GCA_000237265.1_ASM23726v1_genomic
## >GCA_000239235.1_ASM23923v1_genomic
## >GCA_000239555.2_ASM23955v2_genomic
## >GCA_000239575.2_ASM23957v2_genomic
## >GCA_000239595.2_ASM23959v2_genomic
## >GCA_000239615.2_ASM23961v2_genomic
## >GCA_000239635.2_ASM23963v2_genomic
## >GCA_000239655.2_ASM23965v2_genomic
## >GCA_000239675.2_ASM23967v2_genomic
## >GCA_000242475.2_ASM24247v2_genomic
## >GCA_000242495.2_ASM24249v2_genomic
## >GCA_000242515.2_ASM24251v2_genomic
## >GCA_000242535.2_ASM24253v2_genomic
## >GCA_000242555.2_ASM24255v2_genomic
## >GCA_000242575.2_ASM24257v2_genomic
## >GCA_000245495.1_ASM24549v1_genomic
## >GCA_000245575.2_ASM24557v2_genomic
## >GCA_000245595.2_ASM24559v2_genomic
## >GCA_000245615.2_ASM24561v2_genomic
## >GCA_000245635.2_ASM24563v2_genomic
## >GCA_000245655.2_ASM24565v2_genomic
## >GCA_000245695.2_ASM24569v2_genomic
## >GCA_000247025.2_ASM24702v2_genomic
## >GCA_000247065.2_ASM24706v2_genomic
## >GCA_000247105.2_ASM24710v2_genomic
## >GCA_000247125.2_ASM24712v2_genomic
## >GCA_000247145.2_ASM24714v2_genomic
## >GCA_000247165.2_ASM24716v2_genomic
## >GCA_000247205.2_ASM24720v2_genomic
## >GCA_000247255.2_ASM24725v2_genomic
## >GCA_000247275.2_ASM24727v2_genomic
## >GCA_000247315.2_ASM24731v2_genomic
## >GCA_000247335.2_ASM24733v2_genomic
## >GCA_000247375.2_ASM24737v2_genomic
## >GCA_000247395.2_ASM24739v2_genomic
## >GCA_000247415.2_ASM24741v2_genomic
## >GCA_000248475.2_ASM24847v2_genomic
## >GCA_000248495.2_ASM24849v2_genomic
## >GCA_000248515.2_ASM24851v2_genomic
## >GCA_000248535.2_ASM24853v2_genomic
## >GCA_000248555.2_ASM24855v2_genomic
## >GCA_000248555.3_ASM24855v3_genomic
## >GCA_000248575.2_ASM24857v2_genomic
## >GCA_000248595.2_ASM24859v2_genomic
## >GCA_000248615.2_ASM24861v2_genomic
## >GCA_000248635.2_ASM24863v2_genomic
## >GCA_000248655.2_ASM24865v2_genomic
## >GCA_000248675.2_ASM24867v2_genomic
## >GCA_000248695.2_ASM24869v2_genomic
## >GCA_000248715.2_ASM24871v2_genomic
## >GCA_000248735.2_ASM24873v2_genomic
## >GCA_000248755.2_ASM24875v2_genomic
## >GCA_000248775.2_ASM24877v2_genomic
## >GCA_000248795.2_ASM24879v2_genomic
## >GCA_000248815.2_ASM24881v2_genomic
## >GCA_000248835.2_ASM24883v2_genomic
## >GCA_000248855.2_ASM24885v2_genomic
## >GCA_000248875.2_ASM24887v2_genomic
## >GCA_000248895.2_ASM24889v2_genomic
## >GCA_000248915.2_ASM24891v2_genomic
## >GCA_000248935.2_ASM24893v2_genomic
## >GCA_000248955.2_ASM24895v2_genomic
## >GCA_000248975.2_ASM24897v2_genomic
## >GCA_000248995.2_ASM24899v2_genomic
## >GCA_000249015.2_ASM24901v2_genomic
## >GCA_000249035.2_ASM24903v2_genomic
## >GCA_000252405.2_ASM25240v2_genomic
## >GCA_000253135.1_ASM25313v1_genomic
## >GCA_000257965.1_SaureusISKv1.0_genomic
## >GCA_000257985.1_SaureusISMv1.0_genomic
## >GCA_000258685.1_ASM25868v1_genomic
## >GCA_000259715.1_S_aureus_118_1_genomic
## >GCA_000259995.1_ASM25999v1_genomic
## >GCA_000260015.1_SaureusIS157v1.0_genomic
## >GCA_000260295.1_ASM26029v1_genomic
## >GCA_000260315.1_ASM26031v1_genomic
## >GCA_000262815.1_VRSA01_3_genomic
## >GCA_000262835.1_VRSA02_3_genomic
## >GCA_000262855.1_VRSA03_3_genomic
## >GCA_000262875.1_VRSA04_3_genomic
## >GCA_000262895.1_VRSA05_2_genomic
## >GCA_000262915.1_VRSA06_2_genomic
## >GCA_000262935.1_VRSA07_3_genomic
## >GCA_000262955.1_VRSA08_2_genomic
## >GCA_000262975.1_VRSA09_2_genomic
## >GCA_000262995.2_VRSA10_2_genomic
## >GCA_000263015.1_VRSA11a_2_genomic
## >GCA_000263035.1_VRSA11b_2_genomic
## >GCA_000275965.1_ASM27596v1_genomic
## >GCA_000275985.2_ASM27598v2_genomic
## >GCA_000276005.1_ASM27600v1_genomic
## >GCA_000276025.1_ASM27602v1_genomic
## >GCA_000276045.1_ASM27604v1_genomic
## >GCA_000276065.1_ASM27606v1_genomic
## >GCA_000276085.1_ASM27608v1_genomic
## >GCA_000276105.1_ASM27610v1_genomic
## >GCA_000276125.1_ASM27612v1_genomic
## >GCA_000276145.1_ASM27614v1_genomic
## >GCA_000276165.1_ASM27616v1_genomic
## >GCA_000276185.1_ASM27618v1_genomic
## >GCA_000276205.1_ASM27620v1_genomic
## >GCA_000276225.1_ASM27622v1_genomic
## >GCA_000276245.1_ASM27624v1_genomic
## >GCA_000276265.1_ASM27626v1_genomic
## >GCA_000276285.1_ASM27628v1_genomic
## >GCA_000276305.1_ASM27630v1_genomic
## >GCA_000276325.1_ASM27632v1_genomic
## >GCA_000276345.1_ASM27634v1_genomic
## >GCA_000276365.1_ASM27636v1_genomic
## >GCA_000276385.1_ASM27638v1_genomic
## >GCA_000276405.1_ASM27640v1_genomic
## >GCA_000276425.1_ASM27642v1_genomic
## >GCA_000276445.1_ASM27644v1_genomic
## >GCA_000276465.1_ASM27646v1_genomic
## >GCA_000276485.1_ASM27648v1_genomic
## >GCA_000276505.1_ASM27650v1_genomic
## >GCA_000276525.1_ASM27652v1_genomic
## >GCA_000276545.1_ASM27654v1_genomic
## >GCA_000276625.1_Newbould305_assembly_v1_genomic
## >GCA_000280745.1_S.aureus_GR1_1.0_genomic
## >GCA_000284535.1_ASM28453v1_genomic
## >GCA_000284895.1_ASM28489v1_genomic
## >GCA_000294385.1_ASM29438v1_genomic
## >GCA_000294385.2_ASM29438v2_genomic
## >GCA_000296595.1_ASM29659v1_genomic
## >GRE34
## >GRE41
## >HFA6096
## >HFA6162
## >HFA6181
## >HFA6226
## >HFA6228
## >HFA6286
## >HFA6391
## >HUR105
## >ICE102
## >ICE122
## >ICE18
## >ICE192
## >ICE24
## >ICE26
## >ICE7
## >ICE87
## >ICE9
## >ICE95
## >ICE99
## >ITL299
## >ITL34
## >JAP271
## >MCO150
## >MEX35
## >MEX60
## >MEX61
## >TAW113
## >TAW60
## >URU23

Exploring the reference

cat clusts.txt
## sepidermidis-ST307
## sepidermidis-ST2
## sepidermidis-ST788
## sepidermidis-ST212
## sepidermidis-ST556
## sepidermidis-ST184
## sepidermidis-ST184
## sepidermidis-ST171
## sepidermidis-ST719
## sepidermidis-ST595
## sepidermidis-ST329
## sepidermidis-ST328
## sepidermidis-ST35
## sepidermidis-ST23
## sepidermidis-ST19
## sepidermidis-ST189
## sepidermidis-ST51
## sepidermidis-ST369
## sepidermidis-ST72
## sepidermidis-ST44
## sepidermidis-ST23
## sepidermidis-ST40
## sepidermidis-ST66
## sepidermidis-ST215
## sepidermidis-ST35
## sepidermidis-ST5
## sepidermidis-ST21
## sepidermidis-ST55
## sepidermidis-ST1
## sepidermidis-ST4
## sepidermidis-ST22
## sepidermidis-ST230
## sepidermidis-ST56
## sepidermidis-ST21
## sepidermidis-ST14
## sepidermidis-ST8
## saureus-ST151
## saureus-ST398
## saureus-ST5
## saureus-ST5
## saureus-ST5
## saureus-ST254
## saureus-ST1
## saureus-ST36
## saureus-ST1
## sepidermidis-ST10
## saureus-ST250
## saureus-ST8
## saureus-ST8
## saureus-ST105
## saureus-ST8
## saureus-ST105
## saureus-ST5
## saureus-ST225
## saureus-ST239
## saureus-ST93
## saureus-ST239
## saureus-ST239
## saureus-ST1
## saureus-ST1159
## saureus-ST4618
## saureus-ST72
## sepidermidis-ST59
## sepidermidis-ST290
## saureus-ST30
## saureus-ST30
## saureus-ST30
## saureus-ST30
## saureus-ST30
## saureus-ST42
## saureus-ST145
## saureus-ST10
## saureus-ST30
## saureus-ST30
## saureus-ST30
## saureus-ST30
## saureus-ST431
## saureus-ST30
## saureus-ST30
## saureus-ST34
## saureus-ST30
## saureus-ST36
## sepidermidis-ST5
## saureus-ST1
## saureus-ST8
## saureus-ST8
## saureus-ST239
## saureus-ST5
## saureus-ST8
## saureus-ST5
## saureus-ST5
## saureus-ST5
## saureus-ST5
## saureus-ST1892
## saureus-ST5
## saureus-ST5
## saureus-ST5
## saureus-ST8
## saureus-ST8
## saureus-ST5
## saureus-ST5
## saureus-ST5
## sepidermidis-ST57
## saureus-ST8
## saureus-ST105
## saureus-ST105
## saureus-ST5
## saureus-ST5
## saureus-ST923
## saureus-ST8
## saureus-ST700
## saureus-ST2490
## saureus-ST30
## saureus-ST8
## saureus-ST5
## saureus-ST239
## saureus-ST8
## saureus-ST5
## saureus-ST25
## sepidermidis-ST1
## sepidermidis-ST5
## saureus-ST239
## saureus-ST133
## saureus-ST8
## saureus-ST25
## saureus-ST22
## sepidermidis-ST73
## saureus-ST707
## saureus-ST72
## saureus-ST4166
## saureus-ST30
## saureus-ST59
## saureus-ST425
## saureus-ST80
## saureus-ST398
## saureus-ST109
## saureus-ST188
## saureus-ST8
## saureus-ST372
## saureus-ST395
## saureus-ST507
## saureus-ST45
## saureus-ST49
## saureus-ST30
## saureus-ST1447
## saureus-ST254
## saureus-ST80
## saureus-ST8
## saureus-ST34
## saureus-ST88
## saureus-ST30
## sepidermidis-ST2
## sepidermidis-ST8
## sepidermidis-ST1
## sepidermidis-ST23
## sepidermidis-ST22
## sepidermidis-ST73
## sepidermidis-ST384
## sepidermidis-ST23
## sepidermidis-ST136
## sepidermidis-ST329
## saureus-ST8
## saureus-ST5
## saureus-ST1356
## saureus-ST8
## saureus-ST22
## saureus-ST4803
## saureus-ST4166
## saureus-ST5
## saureus-ST8
## saureus-ST5
## saureus-ST346
## saureus-ST36
## saureus-ST36
## saureus-ST36
## saureus-ST36
## saureus-ST5
## saureus-ST5
## saureus-ST1
## saureus-ST8
## saureus-ST5
## saureus-ST5
## saureus-ST36
## saureus-ST5
## saureus-ST36
## saureus-ST8
## saureus-ST664
## saureus-ST609
## saureus-ST45
## saureus-ST346
## saureus-ST36
## saureus-ST45
## saureus-ST634
## saureus-ST36
## saureus-ST8
## saureus-ST105
## saureus-ST1
## saureus-ST15
## saureus-ST398
## saureus-ST5
## sepidermidis-ST130
## saureus-ST4166
## saureus-ST398
## saureus-ST772
## saureus-ST239
## saureus-ST239
## sepidermidis-ST19
## saureus-ST5510
## saureus-ST371
## saureus-ST5
## saureus-ST5
## saureus-ST4166
## saureus-ST231
## saureus-ST85
## saureus-ST231
## saureus-ST5
## saureus-ST5
## saureus-ST5
## saureus-ST5
## saureus-ST5
## sepidermidis-ST89
## sepidermidis-ST185
## sepidermidis-ST430
## sepidermidis-ST7
## sepidermidis-ST5
## sepidermidis-ST5
## sepidermidis-ST2
## sepidermidis-ST2
## sepidermidis-ST2
## sepidermidis-ST323
## sepidermidis-ST218
## sepidermidis-ST324
## sepidermidis-ST325
## sepidermidis-ST20
## sepidermidis-ST326
## sepidermidis-ST7
## sepidermidis-ST327
## sepidermidis-ST328
## sepidermidis-ST329
## sepidermidis-ST73
## sepidermidis-ST170
## sepidermidis-ST330
## sepidermidis-ST331
## sepidermidis-ST331
## sepidermidis-ST332
## sepidermidis-ST333
## sepidermidis-ST334
## sepidermidis-ST72
## sepidermidis-ST184
## sepidermidis-ST72
## saureus-ST115
## saureus-ST1692
## saureus-ST22
## saureus-ST4803
## saureus-ST5
## saureus-ST5
## saureus-ST398
## sepidermidis-ST69
## sepidermidis-ST83
## sepidermidis-ST184
## sepidermidis-ST57
## sepidermidis-ST17
## sepidermidis-ST60
## sepidermidis-ST22
## sepidermidis-ST88
## sepidermidis-ST73
## sepidermidis-ST23
## sepidermidis-ST21
## sepidermidis-ST215
## sepidermidis-ST1
## sepidermidis-ST5
## sepidermidis-ST38
## sepidermidis-ST248
## sepidermidis-ST2
## sepidermidis-ST40
## sepidermidis-ST6
## sepidermidis-ST10
## sepidermidis-ST20
## sepidermidis-ST32
## sepidermidis-ST66
## sepidermidis-ST781
## sepidermidis-ST46
## sepidermidis-ST559
## sepidermidis-ST61
## sepidermidis-ST61
## sepidermidis-ST85
## sepidermidis-ST59
## sepidermidis-ST86

Building an index

Run the ‘build_index’ command from Themisto

mkdir reference_index
mkdir reference_index/tmp
build_index --k 31 --input-file reference.fasta --auto-colors \
        --index-dir reference_index --temp-dir reference_index/tmp \
        --mem-megas 4192 --n-threads 2

Downloading sequencing data from ENA

Running the analysis (one sample)

Run the ‘pseudoalign’ command from Themisto

pseudoalign --query-file ERR033686_1.fastq.gz --outfile \
        ERR033686_1.aln --index-dir reference_index \
        --temp-dir reference_index/tmp --rc \
        --n-threads 2 --gzip-output --sort-output
pseudoalign --query-file ERR033686_2.fastq.gz --outfile \
        ERR033686_2.aln --index-dir reference_index \
        --temp-dir reference_index/tmp --rc \
        --n-threads 2 --gzip-output --sort-output

Use mSWEEP to estimate the abundances of the reference clusters

mSWEEP --themisto-1 ERR033686_1.aln.gz --themisto-2 ERR033686_2.aln.gz \
       --themisto-index reference_index -i clusts.txt -o ERR033686

Looking at the results

sed -n '1,3p' ERR033686_abundances.txt
sort -gk2 ERR033686_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0
## #total_hits: 2028476
## #c_id    mean_theta
## saureus-ST5  0.600885
## saureus-ST398    0.134503
## saureus-ST4166   0.0923787
## saureus-ST22 0.0644022
## saureus-ST1159   0.0468567
## saureus-ST25 0.0225847
## saureus-ST4803   0.00864433
## saureus-ST34 0.00854929
## saureus-ST72 0.00632708
## saureus-ST1892   0.00555749

Running the analysis (multiple samples)

SAMPLES="ERR033658 ERR033686 ERR038357 ERR038367 ERR038366"

for sample in $SAMPLES ; do
  ./get_reads.sh $sample
  pseudoalign --query-file $sample""_1.fastq.gz --outfile \
      $sample""_1.aln --index-dir reference_index --temp-dir \
      reference_index/tmp --rc --n-threads 2 --gzip-output --sort-output
  pseudoalign --query-file $sample""_2.fastq.gz --outfile \
      $sample""_2.aln --index-dir reference_index --temp-dir \
      reference_index/tmp --rc --n-threads 2 --gzip-output --sort-output
  mSWEEP --themisto-1 $sample""_1.aln.gz --themisto-2 $sample""_2.aln.gz \
      --themisto-index reference_index -i clusts.txt -o $sample
done

Looking at the results

sed -n '1,3p' ERR033658_abundances.txt
sort -gk2 ERR033658_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0
## #total_hits: 1693855
## #c_id    mean_theta
## saureus-ST634    0.428317
## saureus-ST5  0.230168
## saureus-ST105    0.132684
## saureus-ST225    0.116941
## saureus-ST4166   0.0778451
## saureus-ST85 0.00882182
## saureus-ST250    0.00252122
## saureus-ST1159   0.00251821
## saureus-ST346    2.11807e-05
## saureus-ST36 1.75996e-05

Looking at the results

sed -n '1,3p' ERR033686_abundances.txt
sort -gk2 ERR033686_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0
## #total_hits: 2028476
## #c_id    mean_theta
## saureus-ST5  0.600885
## saureus-ST398    0.134503
## saureus-ST4166   0.0923787
## saureus-ST22 0.0644022
## saureus-ST1159   0.0468567
## saureus-ST25 0.0225847
## saureus-ST4803   0.00864433
## saureus-ST34 0.00854929
## saureus-ST72 0.00632708
## saureus-ST1892   0.00555749

Looking at the results

sed -n '1,3p' ERR038357_abundances.txt
sort -gk2 ERR038357_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0
## #total_hits: 836224
## #c_id    mean_theta
## saureus-ST22 0.999946
## saureus-ST1  1.76348e-06
## saureus-ST4803   1.61217e-06
## saureus-ST10 1.53883e-06
## saureus-ST923    1.53446e-06
## saureus-ST72 1.2413e-06
## saureus-ST8  1.2412e-06
## saureus-ST772    1.17587e-06
## saureus-ST115    1.17585e-06
## saureus-ST254    1.13616e-06

Looking at the results

sed -n '1,3p' ERR038367_abundances.txt
sort -gk2 ERR038367_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0
## #total_hits: 625499
## #c_id    mean_theta
## saureus-ST5  0.573422
## saureus-ST1447   0.233878
## saureus-ST105    0.114883
## saureus-ST4166   0.0720786
## saureus-ST634    0.0050561
## saureus-ST85 0.000289054
## saureus-ST371    6.37372e-05
## saureus-ST231    5.41944e-05
## saureus-ST225    5.36303e-05
## saureus-ST45 1.29197e-05

Looking at the results

sed -n '1,3p' ERR038366_abundances.txt
sort -gk2 ERR038366_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0
## #total_hits: 751086
## #c_id    mean_theta
## saureus-ST1892   0.803113
## saureus-ST45 0.188378
## saureus-ST1159   0.00301731
## saureus-ST254    0.00199654
## saureus-ST1  0.00163599
## saureus-ST115    0.000933752
## saureus-ST88 0.000817574
## saureus-ST188    1.67652e-05
## saureus-ST25 1.45345e-05
## saureus-ST395    1.01139e-05

Analysis

tol15rainbow=c("#114477", "#4477AA", "#77AADD", "#117755", "#44AA88",
               "#99CCBB", "#777711", "#AAAA44", "#DDDD77", "#771111",
               "#AA4444", "#DD7777", "#771144", "#AA4477", "#DD77AA")

mycols <- c(tol15rainbow[1:13], "#000000", "#555555", "#AAAAAA")

ids <- c("ERR033658", "ERR033686", "ERR038357", "ERR038366", "ERR038367")
files <- paste(ids, "_abundances.txt", sep="")
clusters <- read.table(files[1], stringsAsFactors=FALSE)[, 1]

abundances <- matrix(0, nrow = length(files), ncol = length(clusters))
rownames(abundances) <- ids
colnames(abundances) <- clusters
for (i in 1:length(files)) {
    abundances[i, ] <- read.table(files[i])[, 2]
}

Analysis (cont.)

Further considerations: assembly

Most analyses of interest require either sequence assemblies or isolate sequencing data.

  • SNP calling
  • Phylogenetic inference
  • Transcript expression estimation
  • et cetera

We’ve recently developed a continuation of mSWEEP, called mGEMS, which separates the mixed sequencing data and enables these analyses.

Evaluating mGEMS

Experiment setup: synthetic mixtures of E. coli, E. faecium, and S. aureus strains.

Results of SNP calling from synthetic mixtures vs. isolate sequencing data

Evaluating mGEMS

Experiment: synthetic mixtures of E. coli ST131 strains. Phylogeny from isolates (left) and mixtures (right).

Try it yourself

A tutorial for using mGEMS to reproduce the E. coli phylogenetic analyses is on GitHub

(Extra) Adapt the instructions from the mGEMS tutorial to the S. aureus data we analysed during the lecture.

  • Can you apply mGEMS to the data analyzed during the lecture and assemble the genomes contained in the mixed samples?

Conclusion

  • Fast and accurate identification and quantification of bacterial strains from mixed sequencing data.
  • Nearly perfect accuracy for species with clonal population structures (e.g. E. coli, K. pneumoniae)
  • Assemblies from mixed sequencing data can be obtained using mGEMS, with accuracy in downstream analyses resembling the results of using isolate sequencing data.

If you try the methods and run into problems, you can reach me at tommi.maklin@helsinki.fi.

We’re also happy to hear any feedback on how to make our methods easier to use.